Overview and Motivation

Sepsis is a systemic inflammatory response to infection. The terminology of sepsis is confusing as most modern criterion for sepsis include organ dysfunction, but a 1992 consensus defined “severe sepsis” as sepsis complicated by acute organ dysfunction; “severe sepsis” and sepsis are often used interchangably. [Angus, 2013]

In the hospital, patients used to often managed for sepsis based on “SIRS Criteria.” SIRS, referring to systemic inflammatory response syndrome was defined by physiologic abnormalities: temperature >98C, heart rate >90, respiratory rate that is >20 or arterial CO2 <32mmHg, and an elevated white count above 12,000/mm^3. With this criteria, sepsis was defined as SIRS + a source. This means that a culture proven infection in the context of SIRS defined a patient as having sepsis.

More recently, new criteria have been proposed. The sequential organ failure assessment (SOFA) was proposed and adopted as a prognostication tool for patients in the ICU with a focus on sepsis, but requires substantial invasive measurement. A short form of SOFA, quick SOFA (qSOFA), asks three questions similar to SIRS in order to identify patients who would benefit from a higher level of care. It is worth mentioning that SOFA used to mean “sepsis-related organ failure assessment” rather than “sequeqntial organ failure assessment” in the literature. The latest definition is known as SEPSIS-3, which was developed at the Third International Consensus for Sepsis and Septic Shock.

The murkiness of the definition around sepsis has challenged epidemiological measurement, but researchers such as Angus have attempted to measure the incidence of sepsis using ICD-9 codes that account for infection and organ dysfunction. [Angus, 2001]

In his 2001 paper the found an incidence of 3.0 cases per 1,000 people and 2.26 cases per 100 hospital discharges in 1995. They found a mortality of 28.6% although other studies quote in-hospital sepsis mortality rates at ~50%. [UpToDate]

Regardless, sepsis is a serious disease process, and the use of clinical and laboratory criteria for prognostication remains a challenging problem. The Acute Physiology, Age, and Chronic Health Evaluation (APACHE) system has been popular, with the latest version known as APACHE IV achieving an AUROC of 0.88 per the original paper. SIRS, qSOFA, and SOFA achieved AUROC’s of 0.589, 0.607, and 0.753 for in-hospital mortality in patients primarily hospitalized for an infectious disease process. [Raith, 2017]

Many of these older models are defined based on clinical expertise and consensus meetings, and predictive scores have often been derived using logistic regression. As the previously sited AUROCs suggest, they all leave room for improvement. The goal of this work is to apply modern machine learning techniques to physiologic parameters derived from cinical and laboratory data in an effort to build a better mortality predictor; specifically, rather than using SOFA’s sequential assessment, the goal was to make the prediction using only day from the first day in the ICU.

To accomplish this, the MIMIC-III database was used to develop a cohort of patients identified as having sepsis. Sepsis was defined using the ICD-9 codes developed by Angus to account for infectious processes and organ dysfunction. The outcome, as in the paper from Raith et al. above, was hospital expiration.

Clinical and laboatory data on the first day of the ICU admission were then extracted. After review of the data for errors, an exploratory data analysis was undertaken to understand if these data drawn only on the first day would be associated with the outcome of hospital expiration in order to guide the selection of a predictor set. From their an initial an initial gradient boosted tree model was built using the default parameters described in the XGBoost manual and an iteration count derived from 5-fold cross validation. This initial model was then tuned in a grid search to select the best parameters. The model was then evaluated and sensitivity analysis was perforomed, and the final model is presented for further validation on other data sources.

Initial Questions

  1. What clinical and laboratory variables are associated with hospital mortality?
  2. What is the relationship between these variables? Are some of them redundant? Are there any patterns in the recordings that suggest structure to the data and define groups of patients?
  3. Can a subset of these data be used to build a predictive model of hospital mortality?

Questions that Arose During the Project

Data

The MIMIC-III (Medical Information Mart for Intensive Care) database, a publicly available database maintained by the MIT Laboratory for Computational Physiology that contains health care data on 53,423 distinct hospital admissions at Beth Israel Deaconess Medical Center, was used at the primary data source. [Johnson et al., 2016]

Alistair Johnson, mentioned above in the development of OASIS, is now at the MIT Laboratory of Computational Physiology. In addition to his instrumental work in developing MIMIC-III, he has also developed open source code for extracting various clinical and laboratory paramters from the first day of a patients stay in the ICU from MIMIC-III. His colleague, Tom Pollard, also at the MIT Laboratory for Computational Physiology has developed open source code for selecting sepsis patients from MIMIC-III by various criteria including the Angus criteria mentioned above.

The code can be found here.

These SQL scripts were used to generate the following materialized views:

  1. Angus: a table flagging each admission as meeting Angus “criteria”" for sepsis using the ICD-9 codes for infection and organ failure he published in 2001, and which have been externally validated.
  2. Vitals first day: a view of the min, max, and mean, of important vital sign recordings on the first day in the ICU. These include systolic, diastolic, and mean artieral pressures (MAP), heart rate, respiratory rate, arterial hemoglobin saturation by pulse oximetry, and etc.
  3. GCS first day: a table of the patient’s glasgow coma score on admission to the ICU.
  4. Laboratory first day: a view of the first day lab results for commonly ordered labs for each patient. These include electrolytes (sodium, potassium, chloride, bicarbonate), glucose,white count, number of bands (immature WBC), hemoglobin, hematcrit, lactate, coagulation measures (PT, PTT, INR), and etc.
  5. Arterial blood gas (ABG) first day: ABG, a special type of lab in which blood is drawn from an artery instead of vein, is used to derive accurate measures for oxygen and carbon dioxide levels in the blood, as well as lactate. ABG data is crucial in evaluating a pateints acid-base status, and the values obtained are significantly more informing than venous gas results or data obtained from pulse oximetry.
  6. Urine output first day: Urine output is a marker of fluid status and renal function, both of which are, a priori, highly associated with the level of illness and risk of death.
  7. Renal replacement therapy (RRT) first day: RRT or dialysis as it is more commonly known, is used in patients with severe renal failure and other critical illness contexts in which the kidney’s cannot adequately remove waste products, particularly those associated with protein degradation, nitrogenous waste, and uremia, from the blood. The use of RRT on the first day in the ICU should be highly associated with a patient’s illness burden.
  8. Ventilation first day: ventilation is provided to patients in a variety of critical clinical contexts such as loss of respiratory drive, airway obstruction secondary to process such as asthma or chronic obstructive pulmonary disorder exacerbation, and inability of the patient to protect their airway. As such, the use of ventilation on the first day should be associated with prognosis.
  9. Elixhauser: the Elixhauser score is a system derived using diagnosis codes to estimate the burden of comorbidity a patient has. It is derived in MIMIC-III using avilable diagnosis codes.
  10. ICU stay detail: a view that contains demographic data about patients for each ICU stay, as well as whether or not they died in the hospital (hospital_expire_flag).

There is also the table of diagnoses by ICD-9 codes which was not used for model building but was examined briefly during exploratory data anlysis. The associated concept for ICD-9 codes (d_icd_diagnoses) was used to annotate these as needed.

Because the evaluators of this work will not have access to my local server environment, the above views were generated and then exported from PostgreSQL as CSV files. They will be loaded in now. Of note, some data are available in multiple places; e.g. glucose is present in both vitals and labs, lactate is present a single recording for the blood gas data whereas ranges for it are extracted in the laboratory pull. Variables will only be included once, and concepts like glucose and lactate which are best summarized in the labs table will be drawn from there instead of other souces. Also of note, unlike the other first day tables, which are summary recordings for all first day data, the ABG view contains all recorded ABGs for a patient as seperate observations. Because we want a measure predictor of disease severity it may seem obvious to take the so-called “worst” value. However, making this decision can introduce significant bias as pO2 and pCO2 are parameters which we would expect to be associated with illness status parabolically. That is to say, a very high pCO2 may indicate respiratory collapse, but a very low pCO2 will signify tachypnea which may be secondary to a variety of contexts of varying criticality. As such, a measure of central tendency will be applied. We would expect pCO2 to be fairly normally distributed as very low and very high values are associated with severe acid-base distrubances which are highly associated with death. pO2 however is less likely to obey this, and because of metabolic adaptations we should expect skew in all parameters. As such, the first recorded gas by chart time will be used. This is expected to introduce less bias as only non-differental information bias is potentially introduced.

Lastly, some patients have many admissions to the ICU, and the consensus at the MIT Laboratory for Computatonal Physiology is that building models off ICU stays after the first one can be misleading as a patient is already fairly “run down” after a single ICU course. As such, mortality predictions will be based off the first stay only.

ICU Stay and Angus Criteria

detail <- read_csv("./data/icustay_detail.csv") %>% 
  select(icustay_id, hadm_id, subject_id, hadm_id, age, gender, ethnicity, los_icu, 
         intime, hospital_expire_flag) %>% 
  arrange(intime) %>% 
  distinct(hadm_id, .keep_all = TRUE)
angus <- read_csv("./data/angus_sepsis.csv") %>% 
  select(subject_id, hadm_id, angus)

Vital Signs

vitals <- read_csv("./data/vitals.csv") %>% 
  select(-glucose_mean, -glucose_min, -glucose_max)

Fluid Status and Kidney Function

rrt <- read_csv("./data/rrt.csv")
uo <- read_csv("./data/uo.csv")

Ventilation Status

vent <- read_csv("./data/ventfirst.csv")

Laboratory Values

labs <- read_csv("./data/labs.csv")
abg_first <- read_csv("./data/abg.csv") %>% 
  select(icustay_id, po2, pco2, ph, charttime) %>%
  arrange(charttime) %>% 
  distinct(icustay_id, .keep_all = TRUE) %>% select(-charttime)

Mental Status and Comorbidity Burden

gcs <- read_csv("./data/gcs.csv") %>% select(subject_id, hadm_id, icustay_id, 
                                             mingcs)
# Note: Elixhauster is a set of discrete categories for which a point is given;
# and the sum is the score.
elix <- read_csv("./data/elixhauser.csv") %>% 
  select(hadm_id, elixhauser = elixhauser_vanwalraven)

Formation of the Cohort Dataset

To build a cohort, we’ll begin by combining the angus and detail tables. This will done as an inner join: patients without an Angus status of 0 or 1 will not be included. From this base cohort, we’ll pull in all of the data we loaded above. Because we only kept the first ICU stay ID for each admission, a subject should only appear twice if they have been re-admitted. It is fine for a two admissions to appear, and we make the assumption that they represent distinct events leading to ICU admission although this may not hold in a small subset of the patients. We also make the assumption that previous admission data are uncorrelated with current admission data. This is not necessarily true, but mirror’s real life, and since this tool would be used per admission it seems reasonable to include both admissions as long as we use the first ICU stay’s data per admission.

We then apply the following exclusion criteria: 1) Age <16; our predictions are targeting adults only 2) LOS >4 hours (0.167 days); this excludes patients admitted to the ICU by mistake or for non-critical illness such as post-operatively. 3) We only keep those with angus equal to 1, as we only want to examine patients with sepsis by the Angus ICD-9 codes.

cohort <- inner_join(angus, detail, by = c("hadm_id", "subject_id")) %>% 
  left_join(vitals, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(gcs, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(labs, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(abg_first, by = c("icustay_id")) %>%
  left_join(uo, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(rrt, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(vent, by = c("icustay_id", "hadm_id", "subject_id")) %>%
  left_join(elix, by = c("hadm_id")) %>% 
  filter(angus == 1, los_icu > 0.167, age > 16 ) %>%
  select(-angus, -icustay_id, -intime, -los_icu) %>%
  select(subject_id, hadm_id, hospital_expire_flag, everything())

To confirm that our joins worked as expected, there should a cohort in which a single subject can appear twice, but there is only one ICU admission per admission to the hospital, and a patient should obviously only die once.

We can examine the first two quite simply; for expiration, we are performing a “sanity check.” Because the flag is 0 or 1, and because there can be multiple admissions per subject, the sum of the flag should be 0 or 1, but never more. We will use this idea to check if anyone has been coded as dying twice.

# First two checks
cohort %>% group_by(subject_id) %>% summarise(count = n()) %>% arrange(-count)
cohort %>% group_by(hadm_id) %>% summarise(count = n()) %>% arrange(-count)
# Expiration sanity check
cohort %>% group_by(subject_id) %>% summarise(expiration_sanity_check = sum(hospital_expire_flag)) %>% arrange(-expiration_sanity_check)

All of the checks suggest our dataset is okay. We no longer need subject_id’s as a subject can be present twice, but rather the admission ID (hadm_id), since this is a model for predicting mortality on admission.

Ultimately, we build a gradient boosted tree model using XGBoost. As such, we need to convert the predictors to numeric. Almost all are except for gender, which we can code as 1 or male and 0 for female. We also need to recode ethncity to make it less granular, and the dummy code the variables. Lastly, we are bound to have some ages in the cohort which are ~300. This is because HIPPA regulations require the actual age of patients of >89 years old to be hidden since, given the small size of this population, there is a risk of data leakage. The MIMIC-III guidelines say to shift any ages with these values to 91.4, the median for patients above 89 in the dataset.

cohort <- cohort %>% mutate(male_gender = as.numeric(cohort$gender == "M")) %>%
  select(-gender)

ethnicities <- c("WHITE", "BLACK", "HISPANIC", "ASIAN")
for (i in 1:length(ethnicities)) {
  cohort$ethnicity[str_detect(cohort$ethnicity, ethnicities[i])] <- ethnicities[i]
}
cohort$ethnicity[!(cohort$ethnicity %in% ethnicities)] <- "OTHER"
cohort <- cohort %>% mutate(white_eth = ifelse(ethnicity == "WHITE", 1, 0),
                            black_eth = ifelse(ethnicity == "BLACK", 1, 0),
                            hispanic_eth = ifelse(ethnicity == "HISPANIC", 1, 0),
                            asian_eth = ifelse(ethnicity == "ASIAN", 1, 0),
                            other_eth = ifelse(ethnicity == "OTHER", 1, 0)) %>%
  select(-subject_id) %>%
  select(ethnicity, hadm_id, hospital_expire_flag, male_gender, white_eth, black_eth,  # we keep ethnicity here for EDA,
         hispanic_eth, asian_eth, other_eth, vent, rrt, elixhauser, mingcs, age,       # but will remove later since dummy coded
         everything())  

# Note that I used the select statement above to reorder the data. I've grouped
# variables together to the left of the continuous variables which are together.
# This is useful when you want to use subsetting and only want to look at certian
# variable types.

cohort$age[cohort$age > 89] <- 91.4
glimpse(cohort)
## Observations: 14,938
## Variables: 77
## $ ethnicity            <chr> "WHITE", "WHITE", "WHITE", "WHITE", "WHIT...
## $ hadm_id              <int> 128652, 165660, 185910, 101757, 174486, 1...
## $ hospital_expire_flag <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ male_gender          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,...
## $ white_eth            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ black_eth            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hispanic_eth         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ asian_eth            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ other_eth            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ vent                 <int> 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,...
## $ rrt                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ elixhauser           <int> 9, 7, 7, 3, 27, 15, 39, 31, 10, 10, 1, 6,...
## $ mingcs               <int> 15, 15, 15, 15, 15, 8, 15, 15, 10, 15, 14...
## $ age                  <dbl> 72.2671, 72.7284, 75.9415, 56.6349, 62.71...
## $ heartrate_min        <dbl> 48, 67, 91, 73, 58, 63, 74, 66, 60, 81, 6...
## $ heartrate_max        <int> 54, 93, 129, 114, 69, 74, 86, 85, 78, 100...
## $ heartrate_mean       <dbl> 51.23077, 81.48000, 105.58537, 92.23810, ...
## $ sysbp_min            <int> 85, 84, 67, 98, 66, 96, 83, 84, 107, 95, ...
## $ sysbp_max            <int> 146, 139, 114, 153, 172, 144, 118, 121, 1...
## $ sysbp_mean           <dbl> 114.15385, 108.04167, 94.30000, 127.13636...
## $ diasbp_min           <int> 48, 48, 38, 49, 40, 34, 41, 38, 46, 54, 2...
## $ diasbp_max           <int> 74, 84, 76, 79, 95, 80, 89, 83, 72, 79, 7...
## $ diasbp_mean          <dbl> 59.92308, 62.16667, 62.35000, 64.36364, 6...
## $ meanbp_min           <dbl> 63.0000, 58.0000, 53.3333, 81.0000, 48.00...
## $ meanbp_max           <dbl> 99.0000, 98.0000, 88.0000, 100.0000, 189....
## $ meanbp_mean          <dbl> 79.92308, 72.54167, 73.00001, 93.61112, 8...
## $ resprate_min         <int> 12, 13, 14, 11, 13, 11, 16, 16, 9, 15, 13...
## $ resprate_max         <int> 13, 23, 31, 27, 27, 21, 28, 21, 28, 23, 2...
## $ resprate_mean        <dbl> 12.06667, 18.65517, 24.32500, 17.50000, 1...
## $ tempc_min            <dbl> 36.83334, 36.11111, 34.61111, 36.66667, 3...
## $ tempc_max            <dbl> 37.05555, 37.88889, 37.88889, 38.66667, 3...
## $ tempc_mean           <dbl> 36.95833, 37.16667, 36.52222, 37.12346, 3...
## $ spo2_min             <int> 97, 85, 68, 93, 96, 97, 93, 100, 96, 94, ...
## $ spo2_max             <int> 100, 98, 100, 100, 100, 100, 100, 100, 10...
## $ spo2_mean            <dbl> 98.71429, 94.20000, 96.83784, 98.18182, 9...
## $ aniongap_min         <int> 9, 9, 16, 9, 13, 15, 11, 7, 11, 14, 16, 1...
## $ aniongap_max         <int> 10, 11, 17, 12, 16, 17, 17, 10, 11, 17, 1...
## $ albumin_min          <dbl> 2.7, 2.8, 2.7, NA, 3.1, 3.4, 2.4, NA, 3.4...
## $ albumin_max          <dbl> 2.7, 2.8, 2.7, NA, 3.1, 3.4, 2.9, NA, 3.4...
## $ bands_min            <int> NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 6...
## $ bands_max            <int> NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 6...
## $ bicarbonate_min      <int> 24, 24, 20, 26, 16, 20, 14, 30, 23, 15, 1...
## $ bicarbonate_max      <int> 27, 28, 21, 26, 19, 25, 20, 31, 23, 19, 2...
## $ bilirubin_min        <dbl> 0.9, NA, 1.3, NA, 0.6, 0.5, 0.1, 0.2, 0.3...
## $ bilirubin_max        <dbl> 0.9, NA, 1.3, NA, 0.6, 0.5, 0.2, 0.2, 0.3...
## $ creatinine_min       <dbl> 0.7, 0.9, 1.8, 0.4, 0.9, 1.0, 0.9, 0.9, 0...
## $ creatinine_max       <dbl> 0.9, 1.0, 2.6, 0.5, 1.0, 1.0, 1.9, 1.1, 0...
## $ chloride_min         <int> 95, 106, 106, 104, 113, 104, 126, 101, 11...
## $ chloride_max         <int> 103, 109, 107, 107, 114, 105, 140, 111, 1...
## $ glucose_min          <int> 101, 101, 94, 57, 72, 104, 47, 113, 131, ...
## $ glucose_max          <int> 116, 126, 137, 77, 187, 144, 369, 235, 26...
## $ hematocrit_min       <dbl> 30.0, 32.7, 36.0, 28.5, 25.0, 30.0, 25.0,...
## $ hematocrit_max       <dbl> 30.0, 36.5, 36.0, 33.0, 39.2, 33.2, 31.5,...
## $ hemoglobin_min       <dbl> 10.6, 11.5, 12.5, 9.9, 8.4, 10.5, 7.8, 8....
## $ hemoglobin_max       <dbl> 10.6, 12.7, 12.5, 11.0, 13.6, 11.2, 10.1,...
## $ lactate_min          <dbl> 1.4, 0.7, 2.8, 1.1, 2.1, NA, 1.5, 2.6, 1....
## $ lactate_max          <dbl> 1.5, 1.0, 2.8, 1.1, 2.1, NA, 1.5, 4.5, 2....
## $ platelet_min         <int> 109, 146, 76, 177, 66, 91, 67, 69, 203, 1...
## $ platelet_max         <int> 184, 174, 76, 189, 182, 97, 109, 76, 227,...
## $ potassium_min        <dbl> 3.3, 3.9, 4.2, 3.7, 4.2, 3.9, 2.6, 3.3, 2...
## $ potassium_max        <dbl> 3.7, 4.2, 4.4, 4.1, 5.7, 4.0, 3.7, 4.1, 4...
## $ ptt_min              <dbl> 28.5, 28.0, 28.2, 23.7, 22.9, 29.3, 28.4,...
## $ ptt_max              <dbl> 29.4, 61.7, 50.6, 29.5, 48.1, 30.1, 29.8,...
## $ inr_min              <dbl> 1.1, 1.2, 1.4, 0.9, 1.1, 1.2, 1.3, 1.4, 1...
## $ inr_max              <dbl> 1.2, 1.4, 1.7, 1.3, 1.4, 1.2, 1.4, 1.4, 1...
## $ pt_min               <dbl> 13.0, 14.2, 14.4, 11.5, 13.0, 13.8, 15.2,...
## $ pt_max               <dbl> 13.2, 16.2, 15.6, 13.7, 15.3, 13.9, 15.9,...
## $ sodium_min           <int> 128, 139, 139, 138, 139, 138, 150, 145, 1...
## $ sodium_max           <int> 133, 142, 140, 138, 142, 140, 166, 146, 1...
## $ bun_min              <int> 11, 19, 42, 18, 32, 11, 43, 23, 20, 47, 5...
## $ bun_max              <int> 13, 22, 55, 18, 37, 17, 77, 26, 23, 59, 5...
## $ wbc_min              <dbl> 6.9, 10.9, 18.2, 4.5, 4.8, 4.2, 2.6, 2.3,...
## $ wbc_max              <dbl> 6.9, 15.3, 18.2, 15.9, 13.1, 4.5, 5.0, 2....
## $ po2                  <int> 148, 59, NA, 142, 122, NA, NA, 267, 53, N...
## $ pco2                 <int> 36, 50, NA, 42, 40, NA, NA, 87, 57, NA, N...
## $ ph                   <dbl> 7.47, 7.33, NA, 7.40, 7.32, NA, NA, 7.18,...
## $ urineoutput          <dbl> 3345, 2030, 560, 2305, 1990, 2330, 1225, ...

We don’t want to exclude missing data en masse: lots of specifc tests, e.g. bands which traditionally need to requested on a complete blood count (CBC), will be missing. In additin, gradient boosting can technically handle missing data in tree construction

However, we do want to exclude values which are clinically impossible. This is not to say incompatible with life (e.g. BP = 0), which would certainly expect in the critically ill at times. The key point here being that the tails of distributions for many of these parameters are real and clinically relevant and simple exclusion of outliers by say Tukey’s rule will only bias our results. So, we will carefully apply clinical knowledge to these data with a very conversative approach to discarding data (only excluding things we are certain could not be real). We will also examine how the missing data are distributed by variable to discern whether such variables should be excluded from model buidling procedures.

summary(cohort)
##   ethnicity            hadm_id       hospital_expire_flag  male_gender    
##  Length:14938       Min.   :100011   Min.   :0.0000       Min.   :0.0000  
##  Class :character   1st Qu.:125209   1st Qu.:0.0000       1st Qu.:0.0000  
##  Mode  :character   Median :149858   Median :0.0000       Median :1.0000  
##                     Mean   :149839   Mean   :0.2133       Mean   :0.5311  
##                     3rd Qu.:174835   3rd Qu.:0.0000       3rd Qu.:1.0000  
##                     Max.   :199999   Max.   :1.0000       Max.   :1.0000  
##                                                                           
##    white_eth        black_eth       hispanic_eth       asian_eth      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.7257   Mean   :0.1014   Mean   :0.02999   Mean   :0.02497  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                       
##    other_eth          vent            rrt            elixhauser    
##  Min.   :0.000   Min.   :0.000   Min.   :0.00000   Min.   :-14.00  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:  5.00  
##  Median :0.000   Median :1.000   Median :0.00000   Median : 11.00  
##  Mean   :0.118   Mean   :0.525   Mean   :0.06172   Mean   : 11.57  
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.: 17.00  
##  Max.   :1.000   Max.   :1.000   Max.   :1.00000   Max.   : 54.00  
##                                                                    
##      mingcs           age        heartrate_min    heartrate_max  
##  Min.   : 3.00   Min.   :17.00   Min.   :  0.35   Min.   : 41.0  
##  1st Qu.:14.00   1st Qu.:55.83   1st Qu.: 62.00   1st Qu.: 92.0  
##  Median :15.00   Median :68.78   Median : 72.00   Median :106.0  
##  Mean   :13.51   Mean   :66.90   Mean   : 73.53   Mean   :107.7  
##  3rd Qu.:15.00   3rd Qu.:79.95   3rd Qu.: 84.00   3rd Qu.:122.0  
##  Max.   :15.00   Max.   :91.40   Max.   :137.00   Max.   :280.0  
##  NA's   :171                     NA's   :148      NA's   :149    
##  heartrate_mean     sysbp_min        sysbp_max       sysbp_mean   
##  Min.   : 34.84   Min.   :  1.00   Min.   : 46.0   Min.   : 46.0  
##  1st Qu.: 76.31   1st Qu.: 77.00   1st Qu.:130.0   1st Qu.:103.6  
##  Median : 87.46   Median : 87.00   Median :145.0   Median :112.7  
##  Mean   : 88.46   Mean   : 87.88   Mean   :148.4   Mean   :115.8  
##  3rd Qu.: 99.68   3rd Qu.: 98.00   3rd Qu.:163.0   3rd Qu.:125.9  
##  Max.   :154.97   Max.   :172.00   Max.   :300.0   Max.   :208.0  
##  NA's   :148      NA's   :159      NA's   :157     NA's   :157    
##    diasbp_min       diasbp_max     diasbp_mean       meanbp_min    
##  Min.   :  1.00   Min.   : 31.0   Min.   : 20.72   Min.   :  0.20  
##  1st Qu.: 34.00   1st Qu.: 71.0   1st Qu.: 51.69   1st Qu.: 48.00  
##  Median : 41.00   Median : 81.0   Median : 58.09   Median : 55.67  
##  Mean   : 41.36   Mean   : 83.8   Mean   : 59.00   Mean   : 54.99  
##  3rd Qu.: 49.00   3rd Qu.: 93.0   3rd Qu.: 65.23   3rd Qu.: 63.00  
##  Max.   :104.00   Max.   :287.0   Max.   :132.33   Max.   :119.00  
##  NA's   :159      NA's   :158     NA's   :158      NA's   :149     
##    meanbp_max     meanbp_mean      resprate_min    resprate_max  
##  Min.   : 29.0   Min.   : 29.00   Min.   : 1.00   Min.   : 8.00  
##  1st Qu.: 88.0   1st Qu.: 67.93   1st Qu.:10.00   1st Qu.:24.00  
##  Median :100.0   Median : 74.21   Median :13.00   Median :28.00  
##  Mean   :105.2   Mean   : 75.76   Mean   :13.07   Mean   :28.58  
##  3rd Qu.:113.0   3rd Qu.: 82.22   3rd Qu.:16.00   3rd Qu.:32.00  
##  Max.   :299.0   Max.   :153.84   Max.   :36.00   Max.   :69.00  
##  NA's   :149     NA's   :149      NA's   :164     NA's   :159    
##  resprate_mean     tempc_min       tempc_max       tempc_mean   
##  Min.   : 7.00   Min.   :15.00   Min.   :32.20   Min.   :31.96  
##  1st Qu.:16.80   1st Qu.:35.61   1st Qu.:37.00   1st Qu.:36.39  
##  Median :19.36   Median :36.11   Median :37.50   Median :36.82  
##  Mean   :19.94   Mean   :36.08   Mean   :37.59   Mean   :36.86  
##  3rd Qu.:22.54   3rd Qu.:36.61   3rd Qu.:38.17   3rd Qu.:37.31  
##  Max.   :43.78   Max.   :39.72   Max.   :42.78   Max.   :40.10  
##  NA's   :159     NA's   :334     NA's   :334     NA's   :334    
##     spo2_min         spo2_max        spo2_mean       aniongap_min 
##  Min.   :  2.00   Min.   : 27.00   Min.   : 11.17   Min.   : 1.0  
##  1st Qu.: 89.00   1st Qu.:100.00   1st Qu.: 95.92   1st Qu.:11.0  
##  Median : 93.00   Median :100.00   Median : 97.42   Median :13.0  
##  Mean   : 90.42   Mean   : 99.56   Mean   : 96.95   Mean   :13.4  
##  3rd Qu.: 95.00   3rd Qu.:100.00   3rd Qu.: 98.70   3rd Qu.:15.0  
##  Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :45.0  
##  NA's   :169      NA's   :170      NA's   :169      NA's   :118   
##   aniongap_max    albumin_min     albumin_max      bands_min     
##  Min.   : 4.00   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:14.00   1st Qu.:2.400   1st Qu.:2.500   1st Qu.: 2.000  
##  Median :16.00   Median :2.900   Median :3.000   Median : 6.000  
##  Mean   :16.89   Mean   :2.904   Mean   :3.003   Mean   : 9.204  
##  3rd Qu.:19.00   3rd Qu.:3.400   3rd Qu.:3.500   3rd Qu.:13.000  
##  Max.   :56.00   Max.   :5.700   Max.   :5.700   Max.   :69.000  
##  NA's   :118     NA's   :7294    NA's   :7294    NA's   :11425   
##    bands_max     bicarbonate_min bicarbonate_max bilirubin_min  
##  Min.   : 1.00   Min.   : 5.0    Min.   : 5.00   Min.   : 0.10  
##  1st Qu.: 3.00   1st Qu.:18.0    1st Qu.:21.00   1st Qu.: 0.40  
##  Median : 8.00   Median :22.0    Median :24.00   Median : 0.60  
##  Mean   :11.63   Mean   :21.6    Mean   :24.65   Mean   : 1.92  
##  3rd Qu.:16.00   3rd Qu.:25.0    3rd Qu.:27.00   3rd Qu.: 1.30  
##  Max.   :79.00   Max.   :50.0    Max.   :53.00   Max.   :79.00  
##  NA's   :11424   NA's   :74      NA's   :73      NA's   :5649   
##  bilirubin_max    creatinine_min   creatinine_max    chloride_min
##  Min.   : 0.100   Min.   : 0.100   Min.   : 0.100   Min.   : 61  
##  1st Qu.: 0.400   1st Qu.: 0.700   1st Qu.: 0.900   1st Qu.: 98  
##  Median : 0.700   Median : 1.100   Median : 1.300   Median :102  
##  Mean   : 2.229   Mean   : 1.601   Mean   : 1.949   Mean   :102  
##  3rd Qu.: 1.600   3rd Qu.: 1.800   3rd Qu.: 2.300   3rd Qu.:106  
##  Max.   :82.800   Max.   :28.000   Max.   :29.600   Max.   :145  
##  NA's   :5649     NA's   :56       NA's   :56       NA's   :53   
##   chloride_max    glucose_min     glucose_max     hematocrit_min 
##  Min.   : 67.0   Min.   :  7.0   Min.   :  37.0   Min.   : 7.30  
##  1st Qu.:103.0   1st Qu.: 88.0   1st Qu.: 125.0   1st Qu.:25.20  
##  Median :107.0   Median :106.0   Median : 158.0   Median :28.80  
##  Mean   :107.1   Mean   :112.7   Mean   : 185.4   Mean   :29.19  
##  3rd Qu.:111.0   3rd Qu.:129.0   3rd Qu.: 210.0   3rd Qu.:33.00  
##  Max.   :198.0   Max.   :576.0   Max.   :1746.0   Max.   :61.00  
##  NA's   :50      NA's   :45      NA's   :45       NA's   :62     
##  hematocrit_max  hemoglobin_min   hemoglobin_max   lactate_min    
##  Min.   :16.40   Min.   : 2.300   Min.   : 4.80   Min.   : 0.100  
##  1st Qu.:30.20   1st Qu.: 8.400   1st Qu.: 9.90   1st Qu.: 1.075  
##  Median :33.90   Median : 9.600   Median :11.20   Median : 1.500  
##  Mean   :34.54   Mean   : 9.782   Mean   :11.43   Mean   : 1.831  
##  3rd Qu.:38.10   3rd Qu.:11.100   3rd Qu.:12.70   3rd Qu.: 2.100  
##  Max.   :64.10   Max.   :19.800   Max.   :21.70   Max.   :24.200  
##  NA's   :62      NA's   :78       NA's   :78      NA's   :3567    
##   lactate_max      platelet_min     platelet_max    potassium_min  
##  Min.   : 0.300   Min.   :   5.0   Min.   :   9.0   Min.   :0.600  
##  1st Qu.: 1.500   1st Qu.: 120.0   1st Qu.: 156.0   1st Qu.:3.400  
##  Median : 2.300   Median : 187.0   Median : 228.0   Median :3.800  
##  Mean   : 3.195   Mean   : 205.5   Mean   : 250.7   Mean   :3.791  
##  3rd Qu.: 3.800   3rd Qu.: 266.0   3rd Qu.: 316.0   3rd Qu.:4.200  
##  Max.   :32.000   Max.   :1592.0   Max.   :1608.0   Max.   :9.200  
##  NA's   :3567     NA's   :89       NA's   :89       NA's   :42     
##  potassium_max       ptt_min          ptt_max          inr_min      
##  Min.   : 1.900   Min.   : 14.40   Min.   : 14.40   Min.   : 0.500  
##  1st Qu.: 4.100   1st Qu.: 26.10   1st Qu.: 28.40   1st Qu.: 1.100  
##  Median : 4.500   Median : 29.90   Median : 34.30   Median : 1.300  
##  Mean   : 4.714   Mean   : 32.85   Mean   : 46.14   Mean   : 1.487  
##  3rd Qu.: 5.100   3rd Qu.: 35.50   3rd Qu.: 47.60   3rd Qu.: 1.500  
##  Max.   :19.600   Max.   :150.00   Max.   :150.00   Max.   :22.800  
##  NA's   :42       NA's   :1303     NA's   :1303     NA's   :1251    
##     inr_max           pt_min           pt_max         sodium_min   
##  Min.   : 0.800   Min.   :  8.00   Min.   :  9.10   Min.   : 95.0  
##  1st Qu.: 1.200   1st Qu.: 13.00   1st Qu.: 13.50   1st Qu.:134.0  
##  Median : 1.400   Median : 14.10   Median : 15.10   Median :137.0  
##  Mean   : 1.873   Mean   : 15.87   Mean   : 18.52   Mean   :136.5  
##  3rd Qu.: 1.800   3rd Qu.: 16.20   3rd Qu.: 18.60   3rd Qu.:140.0  
##  Max.   :48.200   Max.   :150.00   Max.   :150.00   Max.   :174.0  
##  NA's   :1251     NA's   :1248     NA's   :1248     NA's   :45     
##    sodium_max       bun_min          bun_max          wbc_min      
##  Min.   :108.0   Min.   :  1.00   Min.   :  2.00   Min.   :  0.10  
##  1st Qu.:137.0   1st Qu.: 15.00   1st Qu.: 18.00   1st Qu.:  6.80  
##  Median :140.0   Median : 24.00   Median : 29.00   Median :  9.90  
##  Mean   :140.3   Mean   : 31.26   Mean   : 37.03   Mean   : 11.34  
##  3rd Qu.:143.0   3rd Qu.: 40.00   3rd Qu.: 48.00   3rd Qu.: 13.90  
##  Max.   :182.0   Max.   :254.00   Max.   :272.00   Max.   :575.80  
##  NA's   :44      NA's   :56       NA's   :56       NA's   :92      
##     wbc_max            po2            pco2              ph       
##  Min.   :  0.10   Min.   : 16    Min.   :  8.00   Min.   :6.400  
##  1st Qu.:  9.20   1st Qu.: 77    1st Qu.: 34.00   1st Qu.:7.280  
##  Median : 13.10   Median :114    Median : 41.00   Median :7.370  
##  Mean   : 15.25   Mean   :165    Mean   : 44.15   Mean   :7.347  
##  3rd Qu.: 18.50   3rd Qu.:219    3rd Qu.: 49.00   3rd Qu.:7.430  
##  Max.   :846.70   Max.   :797    Max.   :193.00   Max.   :7.890  
##  NA's   :92       NA's   :4293   NA's   :4294     NA's   :4295   
##   urineoutput   
##  Min.   :-2600  
##  1st Qu.:  810  
##  Median : 1420  
##  Mean   : 1692  
##  3rd Qu.: 2282  
##  Max.   :45680  
##  NA's   :739

From this we can make the following observations:

  • Heart rate min, mean, and max are all clinically possible values.
  • The blood pressure values are all clinically possible.
  • The respiratory rates are all clinically possible.
  • The temperatures are all clinically possible (e.g. severe hypothermia).
  • The saturation values are reasonable as they are constrained by 0 and 100.
  • The lab values appear to be possible in most cases. For example the very low sodium values can be seen with intracranial masses, potassium levels can get very high or low from diuretic use, and etc.
  • Specifically some of the WBC look very high, but perusal of the top patient’s records show they stayed at this level, and thus likely had a leukemia.
  • The bilirubin values that appear extreme can occur in processes which obstruct the biliary tree such as pancreatic adenocarcinoma, or in processes which result in liver damaga such acute/chonic hepatitis or cirrhosis.
  • The high platelet counts can also be encountered in processes like essential thrombocytosis. Very low counts can be see in consumptive processes such as thrombotic thrombocytopenic purpura.
  • Urine output appears to have a negative number which is impossible:
cohort %>% select(urineoutput) %>% arrange(urineoutput)

There are 5 below 0; these numbers can either be zeroed or the patients can be removed. As there are ~15k admissions, it is less biased to toss the patients then to plug in a value.

cohort <- cohort %>% filter(urineoutput > 0)
  • There are some very high lactate values, but even personally I’ve see extreme lactate values in the ICU during shock recussitations and events such as bowel infarction.

  • The bloog gas values, too, are clinically reasonable. pO2 can get unresonably high in real life when FiO2 is set very high, and pCO2 appears constrained to reasonable values. pH is contrained from 6.4 (very acidic) to 7.8 (very basic), but both can occur during shock states or poisonings as respective examples.

Overall, there are bound to be data entry errors in this data, but I am more comfortable with non-differential information bias that hurts model fitting than picking and choosing what stays and introducing a selection bias.

With this basic review of the dataset complete, with most data kept given the clinically possibility of occuring, we are now ready to explore the data.

Exploratory Data Analysis (EDA)

We will begin EDA by asking for simple questions of our data that get to the underlying ideas expressed in our Initial Questions section.

1) What are our cohort demographics?

Sex and Age

mean(cohort$age)
## [1] 66.91273
mean(cohort$male_gender)
## [1] 0.5307556
cohort %>% ggplot() + geom_bar(position = "fill", aes(x = 0, fill = as.factor(male_gender))) + 
  coord_flip() + xlab("") + ylab("Proportion") +
  theme(axis.title.y = element_blank(), axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        aspect.ratio = .15 ) + 
  scale_fill_discrete(labels = c("Male", "Female"), name = "Gender") +
  ggtitle("Gender Proportion")

Ethnicity?

cohort %>% group_by(ethnicity) %>% summarise(pcount = n()) %>% ungroup() %>%
  mutate(Proportion = round(pcount / sum(pcount), 2)) %>% select(-pcount, Ethnicity = ethnicity) %>% knitr::kable()
Ethnicity Proportion
ASIAN 0.03
BLACK 0.09
HISPANIC 0.03
OTHER 0.12
WHITE 0.73
cohort %>% ggplot() + geom_bar(aes(x = as.factor(ethnicity))) + 
  xlab("Ethnicity") + ggtitle("Ethnicity vs. Death") + 
  scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")

2. How many patients died? Does death vary by demography?

mean(cohort$hospital_expire_flag)
## [1] 0.2065981
cohort %>% group_by(male_gender) %>% summarise(prop_died = mean(hospital_expire_flag)) %>%
  ggplot() + geom_col(aes(x = as.factor(male_gender), y = prop_died))

# Better graph for making website
cohort %>% group_by(male_gender) %>% ggplot() + 
  geom_bar(position = "fill", aes(x = as.factor(male_gender), fill = as.factor(hospital_expire_flag))) + 
  coord_flip() + xlab("Gender") + ylab("Proportion") + scale_x_discrete(labels = c("Female", "Male")) +
  theme(aspect.ratio = .15 ) +
  ggtitle("Gender vs. Death") + scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")

There is only a slight gender difference, contrary to many other disease processes which often have gender differentials.

cohort %>% ggplot() + geom_bar(aes(x = as.factor(ethnicity), fill = as.factor(hospital_expire_flag))) + 
  xlab("Ethnicity") + scale_fill_discrete(labels = c("Survived", "Expired"), name = "Status") +
  ggtitle("Ethnicity vs. Death") + scale_fill_brewer(palette = "Set1", labels = c("Survived", "Expired"), name = "Status")
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

Compared to all other ethnicities, those identifying as Hispanic had a lower proprtion of death.

Asians had a slightly lower proportion than non-Asians.

The binned group “OTHER”, which contains those with “unknown” and number of other ethncities that were input into MIMIC-III has a larger proprtion of death than those identifying as other races.

We can look at the crude odds ratio for ethnicity.

cohort$ethnicity <- as.factor(cohort$ethnicity)
cohort$ethnicity <- relevel(cohort$ethnicity, ref = "WHITE")
MIMICbook::plot_OR_by_level(cohort, "ethnicity", "hospital_expire_flag") + ggtitle("OR Death, by Ethncity") +
  xlab("Ethnicity")

And also , and also stratify on gender.

MIMICbook::plot_OR_by_level(cohort %>% mutate(gender = ifelse(male_gender == 1, "Male", "Female")), "ethnicity", "hospital_expire_flag", factor.var2 = "gender") + ggtitle("OR Death, by Gender and Ethncity") +
  xlab("Gender") + scale_color_brewer(palette = "Set1", name = "Ethnicity")

Interestingly, the CI suggest we lose significance when we stratify ethnicity on gender.

There are certainly differences by ethnicity.

For age we’ll examine the proprtion of death by quntile of age.

groups <- 5
age_quantile <- with(cohort, cut(age, 
                                breaks = quantile(age, 
                                                  probs = seq(0,1, by = (1/groups)), 
                                                  na.rm=TRUE), 
                                include.lowest = TRUE))

cohort %>% mutate(age_group = age_quantile) %>% group_by(age_group) %>%
  summarise(prop_died = mean(hospital_expire_flag)) %>%
  ggplot() + geom_point(aes(x = age_group, y = prop_died)) + 
  xlab("Age Quintile") + ylab("Proportion Expired") + 
  geom_text(aes(x = age_group, y = prop_died+.005, label = round(prop_died, 2))) +
  ggtitle("Age Group vs. Proportion Dead") + 
  geom_smooth(aes(x = age_group, y = prop_died, group = 1), method = c("lm"))

There appears to be a linear relationship between age group and proportion dying, as we would expect. We can repeat this graph and split by gender to see if we would expect an interaction.

cohort %>% mutate(age_group = age_quantile) %>% group_by(age_group, male_gender) %>%
  summarise(prop_died = mean(hospital_expire_flag)) %>%
  ggplot() + geom_point(aes(x = age_group, y = prop_died, color = as.factor(male_gender))) + 
  xlab("Age Quintile") + ylab("Proportion Expired") + 
  geom_text(aes(x = age_group, y = prop_died+.005, label = round(prop_died, 2))) +
  ggtitle("Age Group vs. Proportion Dead, Examining Gender Interaction") + 
  scale_color_discrete(labels = c("Male", "Female"), name = "Gender") + 
  geom_smooth(aes(x = age_group, y = prop_died, group = male_gender), method = c("lm"))

The linear relationship holds, and the lines look fairly parallel suggesting interaction is likely not present and doesn’t need to be explored.

We can also look at this by ethnicity, and ethnicity stratified by gender.

Overall, the trend holds, with slightly varying slopes between groups. I suspect this is secondary to the sample size and is not suggestive of effect modification.

Lastly, lets look at Elixhauser.

groups <- 5
elix_quantile <- with(cohort, cut(elixhauser, 
                                breaks = quantile(elixhauser, 
                                                  probs = seq(0,1, by = (1/groups)), 
                                                  na.rm=TRUE), 
                                include.lowest = TRUE))

cohort %>% mutate(elix_group = elix_quantile) %>% group_by(elix_group) %>%
  summarise(prop_died = mean(hospital_expire_flag)) %>%
  ggplot() + geom_point(aes(x = elix_group, y = prop_died)) + 
  xlab("Elixhauser Quintile") + ylab("Proportion Expired") + 
  geom_text(aes(x = elix_group, y = prop_died+.005, label = round(prop_died, 2))) +
  ggtitle("Elixhauser Group vs. Proportion Dead") + 
  geom_smooth(aes(x = elix_group, y = prop_died, group = 1), method = c("lm"))

As we would expect, mortality correlates strongly with Elixhauser.

Before proceeding, we can remove the ethnicity variable we saved at the front of the dataset since we will use the dummy coded version going forward.

cohort <- cohort %>% select(-ethnicity)

3. How are the variables associated with outcome?

Before we get specifically to the association between the variables and outcome, lets see what the distribution of the continuous variables looks like.

melt(cohort %>% select(-hospital_expire_flag, -male_gender, -white_eth, 
                       -black_eth, -hispanic_eth, -asian_eth, -other_eth, -vent, -rrt), 
     id.vars = "hadm_id") %>%
  ggplot() + geom_histogram(aes(x = value)) + facet_wrap(~variable, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 73657 rows containing non-finite values (stat_bin).

This figure has a lot going on, but allows us to quickly scan for the appearance of normality and skew. As we would expect, many parameters, such as heart rate, which are clinically normal on either side of the mean, appear normally distributed. Parameters for which only one side of the mean is considered normal, such as BUN, appear quite skewed.

Regardless, we can use tests of association that assume normality. In most cases it will be appropriate because of the shape of the distribution, but where it is not, we can justify such tests by the number of patients in our cohort in reliance of the central limit theorem (CLT).

We can very eaily examine the association between the exposure and outcome using the package Table 1, and stratifying by the outcome.

CreateTableOne(vars = colnames(cohort %>% select(-hadm_id, -hospital_expire_flag)),
               data = cohort, strata = "hospital_expire_flag")
##                              Stratified by hospital_expire_flag
##                               0                 1                 p     
##   n                             11183              2912                 
##   male_gender (mean (sd))        0.53 (0.50)       0.54 (0.50)     0.119
##   white_eth (mean (sd))          0.74 (0.44)       0.72 (0.45)     0.192
##   black_eth (mean (sd))          0.10 (0.30)       0.07 (0.25)    <0.001
##   hispanic_eth (mean (sd))       0.03 (0.18)       0.02 (0.15)     0.003
##   asian_eth (mean (sd))          0.03 (0.16)       0.02 (0.15)     0.512
##   other_eth (mean (sd))          0.11 (0.31)       0.16 (0.37)    <0.001
##   vent (mean (sd))               0.51 (0.50)       0.62 (0.49)    <0.001
##   rrt (mean (sd))                0.04 (0.20)       0.06 (0.24)    <0.001
##   elixhauser (mean (sd))        10.65 (8.15)      14.31 (8.57)    <0.001
##   mingcs (mean (sd))            13.62 (2.66)      13.04 (3.38)    <0.001
##   age (mean (sd))               65.98 (16.68)     70.48 (15.34)   <0.001
##   heartrate_min (mean (sd))     73.21 (15.46)     75.04 (18.52)   <0.001
##   heartrate_max (mean (sd))    106.48 (21.30)    112.99 (24.52)   <0.001
##   heartrate_mean (mean (sd))    87.58 (16.23)     92.23 (18.13)   <0.001
##   sysbp_min (mean (sd))         89.80 (17.28)     81.80 (19.71)   <0.001
##   sysbp_max (mean (sd))        149.61 (24.89)    145.50 (25.99)   <0.001
##   sysbp_mean (mean (sd))       117.41 (16.76)    111.26 (17.24)   <0.001
##   diasbp_min (mean (sd))        42.44 (11.52)     38.38 (12.26)   <0.001
##   diasbp_max (mean (sd))        84.51 (18.90)     82.41 (20.77)   <0.001
##   diasbp_mean (mean (sd))       59.86 (10.61)     56.88 (10.70)   <0.001
##   meanbp_min (mean (sd))        56.27 (13.62)     51.10 (15.28)   <0.001
##   meanbp_max (mean (sd))       105.58 (28.41)    105.40 (33.63)    0.766
##   meanbp_mean (mean (sd))       76.71 (11.13)     73.40 (11.46)   <0.001
##   resprate_min (mean (sd))      12.95 (3.86)      13.63 (4.41)    <0.001
##   resprate_max (mean (sd))      28.18 (6.71)      30.19 (7.42)    <0.001
##   resprate_mean (mean (sd))     19.64 (4.23)      21.21 (4.78)    <0.001
##   tempc_min (mean (sd))         36.14 (0.84)      35.89 (1.04)    <0.001
##   tempc_max (mean (sd))         37.63 (0.86)      37.48 (1.02)    <0.001
##   tempc_mean (mean (sd))        36.90 (0.68)      36.71 (0.86)    <0.001
##   spo2_min (mean (sd))          91.34 (7.81)      87.44 (13.56)   <0.001
##   spo2_max (mean (sd))          99.62 (0.93)      99.35 (2.41)    <0.001
##   spo2_mean (mean (sd))         97.22 (2.07)      96.01 (4.92)    <0.001
##   aniongap_min (mean (sd))      12.92 (3.20)      14.67 (4.43)    <0.001
##   aniongap_max (mean (sd))      16.32 (4.57)      18.50 (5.90)    <0.001
##   albumin_min (mean (sd))        2.98 (0.66)       2.68 (0.70)    <0.001
##   albumin_max (mean (sd))        3.07 (0.65)       2.81 (0.70)    <0.001
##   bands_min (mean (sd))          8.94 (9.10)      10.17 (10.80)    0.001
##   bands_max (mean (sd))         11.31 (11.09)     12.94 (12.99)   <0.001
##   bicarbonate_min (mean (sd))   21.96 (5.37)      20.04 (6.20)    <0.001
##   bicarbonate_max (mean (sd))   24.91 (5.02)      23.44 (5.69)    <0.001
##   bilirubin_min (mean (sd))      1.54 (3.84)       3.22 (6.45)    <0.001
##   bilirubin_max (mean (sd))      1.80 (4.27)       3.75 (7.16)    <0.001
##   creatinine_min (mean (sd))     1.45 (1.41)       1.67 (1.29)    <0.001
##   creatinine_max (mean (sd))     1.79 (1.72)       2.00 (1.48)    <0.001
##   chloride_min (mean (sd))     102.15 (6.81)     101.94 (7.39)     0.146
##   chloride_max (mean (sd))     107.26 (6.94)     107.22 (7.54)     0.783
##   glucose_min (mean (sd))      113.15 (39.11)    113.61 (46.36)    0.586
##   glucose_max (mean (sd))      184.50 (109.07)   190.38 (103.31)   0.009
##   hematocrit_min (mean (sd))    29.29 (5.95)      28.81 (6.18)    <0.001
##   hematocrit_max (mean (sd))    34.67 (5.87)      34.25 (6.11)     0.001
##   hemoglobin_min (mean (sd))     9.85 (2.03)       9.60 (2.06)    <0.001
##   hemoglobin_max (mean (sd))    11.50 (2.04)      11.28 (2.08)    <0.001
##   lactate_min (mean (sd))        1.61 (0.96)       2.50 (2.16)    <0.001
##   lactate_max (mean (sd))        2.82 (2.17)       4.37 (3.90)    <0.001
##   platelet_min (mean (sd))     209.74 (120.67)   186.29 (130.37)  <0.001
##   platelet_max (mean (sd))     254.86 (137.65)   232.92 (148.37)  <0.001
##   potassium_min (mean (sd))      3.76 (0.58)       3.86 (0.67)    <0.001
##   potassium_max (mean (sd))      4.67 (0.97)       4.82 (1.02)    <0.001
##   ptt_min (mean (sd))           31.76 (11.25)     35.97 (15.35)   <0.001
##   ptt_max (mean (sd))           43.97 (29.46)     52.32 (34.98)   <0.001
##   inr_min (mean (sd))            1.43 (0.70)       1.66 (1.09)    <0.001
##   inr_max (mean (sd))            1.76 (1.57)       2.20 (2.12)    <0.001
##   pt_min (mean (sd))            15.45 (5.67)      17.09 (7.99)    <0.001
##   pt_max (mean (sd))            17.81 (10.59)     20.67 (14.01)   <0.001
##   sodium_min (mean (sd))       136.53 (5.44)     136.28 (6.31)     0.033
##   sodium_max (mean (sd))       140.33 (5.37)     140.27 (6.10)     0.617
##   bun_min (mean (sd))           28.86 (22.64)     38.81 (26.94)   <0.001
##   bun_max (mean (sd))           34.51 (26.09)     44.64 (29.93)   <0.001
##   wbc_min (mean (sd))           10.90 (7.21)      12.78 (16.25)   <0.001
##   wbc_max (mean (sd))           14.72 (10.60)     17.10 (22.97)   <0.001
##   po2 (mean (sd))              170.14 (123.16)   152.28 (114.45)  <0.001
##   pco2 (mean (sd))              44.50 (16.44)     42.79 (17.04)   <0.001
##   ph (mean (sd))                 7.35 (0.11)       7.33 (0.13)    <0.001
##   urineoutput (mean (sd))     1838.06 (1346.52) 1195.46 (1081.46) <0.001
##                              Stratified by hospital_expire_flag
##                               test
##   n                               
##   male_gender (mean (sd))         
##   white_eth (mean (sd))           
##   black_eth (mean (sd))           
##   hispanic_eth (mean (sd))        
##   asian_eth (mean (sd))           
##   other_eth (mean (sd))           
##   vent (mean (sd))                
##   rrt (mean (sd))                 
##   elixhauser (mean (sd))          
##   mingcs (mean (sd))              
##   age (mean (sd))                 
##   heartrate_min (mean (sd))       
##   heartrate_max (mean (sd))       
##   heartrate_mean (mean (sd))      
##   sysbp_min (mean (sd))           
##   sysbp_max (mean (sd))           
##   sysbp_mean (mean (sd))          
##   diasbp_min (mean (sd))          
##   diasbp_max (mean (sd))          
##   diasbp_mean (mean (sd))         
##   meanbp_min (mean (sd))          
##   meanbp_max (mean (sd))          
##   meanbp_mean (mean (sd))         
##   resprate_min (mean (sd))        
##   resprate_max (mean (sd))        
##   resprate_mean (mean (sd))       
##   tempc_min (mean (sd))           
##   tempc_max (mean (sd))           
##   tempc_mean (mean (sd))          
##   spo2_min (mean (sd))            
##   spo2_max (mean (sd))            
##   spo2_mean (mean (sd))           
##   aniongap_min (mean (sd))        
##   aniongap_max (mean (sd))        
##   albumin_min (mean (sd))         
##   albumin_max (mean (sd))         
##   bands_min (mean (sd))           
##   bands_max (mean (sd))           
##   bicarbonate_min (mean (sd))     
##   bicarbonate_max (mean (sd))     
##   bilirubin_min (mean (sd))       
##   bilirubin_max (mean (sd))       
##   creatinine_min (mean (sd))      
##   creatinine_max (mean (sd))      
##   chloride_min (mean (sd))        
##   chloride_max (mean (sd))        
##   glucose_min (mean (sd))         
##   glucose_max (mean (sd))         
##   hematocrit_min (mean (sd))      
##   hematocrit_max (mean (sd))      
##   hemoglobin_min (mean (sd))      
##   hemoglobin_max (mean (sd))      
##   lactate_min (mean (sd))         
##   lactate_max (mean (sd))         
##   platelet_min (mean (sd))        
##   platelet_max (mean (sd))        
##   potassium_min (mean (sd))       
##   potassium_max (mean (sd))       
##   ptt_min (mean (sd))             
##   ptt_max (mean (sd))             
##   inr_min (mean (sd))             
##   inr_max (mean (sd))             
##   pt_min (mean (sd))              
##   pt_max (mean (sd))              
##   sodium_min (mean (sd))          
##   sodium_max (mean (sd))          
##   bun_min (mean (sd))             
##   bun_max (mean (sd))             
##   wbc_min (mean (sd))             
##   wbc_max (mean (sd))             
##   po2 (mean (sd))                 
##   pco2 (mean (sd))                
##   ph (mean (sd))                  
##   urineoutput (mean (sd))

With each association calculated, we can see which variables were signicantly associated at an alpha cutoff of <0.05. 59 of the 66 continuous variables are associated with hospital death. As we can see, vent and rrt are associated, and some, but not all of the race categories are associated. Male gender was not associated.

4. Is there an underlying structure to the data?

We can start by simply asking, are there any significant correlations? We can automate this search with a loop (adapted from StackOverflow). We’ll use a p-value cutoff of 0.5 for signifance, and only look for “strong” relationships where |r| > 0.7. We’ll also view the output of our scan as a plot, asking ggplot2 to fit a regression line for all of the associations that meet criteria.

Note: we’ll use some simple string detection to try and avoid things that are obviously correlated like heartrate_max and heartrate_mean.

cont_var <- colnames(cohort %>% select(-hadm_id, -hospital_expire_flag, -male_gender, -white_eth, -black_eth, 
-hispanic_eth, -asian_eth, -other_eth, 
-vent, -rrt))
correlations <- rcorr(as.matrix(cohort[, cont_var]))
m <- length(cont_var)
for (i in 1:m) {
  for (j in 1:m) {
    if (!is.na(correlations$P[i,j])) {
      if (correlations$P[i,j] < 0.05 & abs(correlations$r[i,j]) > 0.7) {
        xparam <- str_split(rownames(correlations$P)[i], "_")[[1]][1]
        yparam <- str_split(rownames(correlations$P)[j], "_")[[1]][1]
        if (!identical(xparam, yparam)) {
          p <- ggplot(cohort) + 
            geom_point(aes(x = get(rownames(correlations$P)[i]), y = get(colnames(correlations$P)[j]))) +
            xlab(rownames(correlations$P)[i]) + ylab(colnames(correlations$P)[j]) +
            geom_smooth(aes(x = get(rownames(correlations$P)[i]), y = get(colnames(correlations$P)[j])), method = c("lm"))
          print(p)
        }
      }
    }
  }
}
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).

## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).

## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 33 rows containing missing values (geom_point).

## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).

## Warning: Removed 60 rows containing non-finite values (stat_smooth).

## Warning: Removed 60 rows containing missing values (geom_point).

## Warning: Removed 60 rows containing non-finite values (stat_smooth).

## Warning: Removed 60 rows containing missing values (geom_point).

## Warning: Removed 60 rows containing non-finite values (stat_smooth).

## Warning: Removed 60 rows containing missing values (geom_point).

## Warning: Removed 1177 rows containing non-finite values (stat_smooth).
## Warning: Removed 1177 rows containing missing values (geom_point).

## Warning: Removed 1177 rows containing non-finite values (stat_smooth).

## Warning: Removed 1177 rows containing missing values (geom_point).

## Warning: Removed 1177 rows containing non-finite values (stat_smooth).

## Warning: Removed 1177 rows containing missing values (geom_point).

## Warning: Removed 1177 rows containing non-finite values (stat_smooth).

## Warning: Removed 1177 rows containing missing values (geom_point).

## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 33 rows containing missing values (geom_point).

The only strong significant associations are between variables which are, by definition or nature, linked: 1) Systolic and diastolic pressure define MAP 2) Sodium and chloride are intrinsically linked as a cation-anion pair and stay in balance. 3) Hemoglobin is a measure of the concentration of hemoglobin in the blood, and hematocrit is the fraction of the blood volume which is red blood cells. As such, they are intrinsically linked by definition. 4) PT and PTT test the two main pathways in the coagulation cascade (extrinsic, and intrinsic), and INR is a normalized version of PT done at each laboratory in order to control for the susceptibility of PT to chemical factors in each laboratory. Thus INR and PT are intrinsically linked by definition.

With respect to point 4, we do note two seperate lines on the graphs for PT and INR. Because of the way INR is derived, this is likely capturing a change that occurred in the laboratory standard, and doesn’t warrant further exploration.

Now lets perform a principle component analysis.

cohort_mat <- as.matrix(cohort[,3:76])
cohort_mat[is.na(cohort_mat)] <- 0 # An unfortunate requirement of PCA
cohort_pca <- prcomp(cohort_mat, center = TRUE, scale = TRUE)
autoplot(cohort_pca, x = 1, y = 2)

autoplot(cohort_pca, x = 3, y = 4)

Looking at the first 4 PCs we see that while there appears to be some group separation, the groups that separate off are quite small. As this PCA was performed to explore the data and not to specifically improve model buidling, and since the discriminatory effect appears minimal, this approach won’t be developed further.

In our EDA we’ve learned the following: 1) Demographic factors seem to matter for hospital mortality. 2) Most of the variables are univariately associated with the outcome. 3) There does not appear to be a significant amount of correlation between variables past the obvious associations discussed. 4) The first few PC’s obtained from PCA do not very clearly separate the populations, although future work could attempt to identify what the little off-shoots are.

With respect to point 2), some non-significant variables are worth excluding, but not all of them. Chloride_min and max can go; they’re not associated and chloride_mean will remain. However, mean_bp_max, glucose_min, sodium_min, and sodium_max all seem clinically important, and will be kept at least until their lack of importance has been demonstrated in the GBM importance matrix. In addition, Elixhauser, while not predicting death, is expected to be important in stratifying patients because it is the comorbidity burden, and should be kept for examination in the model. Gender did not seem important, but we can also let this be examined in the model, and since some of the race dummy vars were significant, we’ll keep them all.

cohort <- cohort %>% select(-chloride_min, -chloride_max)

With respect to point 3) we should remove variables that are “redundant” by nature of deinition. As such, we’ll keep MAP dropping diastolic and systolic pressures, INR dropping PT, and hemoglobin dropping hematocrit.

cohort <- cohort %>% select(-diasbp_min, -diasbp_max, -diasbp_mean, -sysbp_min,
                            -sysbp_max, -sysbp_mean, -pt_min, -pt_max, 
                            -hematocrit_min, -hematocrit_max)

Finally, before we move onto building our model, lets revist the amount of NA present in our data and see if it is worth not using a varaible that has substantial missing data, and could thus do more harm to model fitting than good.

sapply(cohort, function(x) sum(is.na(x)))
##              hadm_id hospital_expire_flag          male_gender 
##                    0                    0                    0 
##            white_eth            black_eth         hispanic_eth 
##                    0                    0                    0 
##            asian_eth            other_eth                 vent 
##                    0                    0                    0 
##                  rrt           elixhauser               mingcs 
##                    0                    0                   15 
##                  age        heartrate_min        heartrate_max 
##                    0                    0                    1 
##       heartrate_mean           meanbp_min           meanbp_max 
##                    0                    0                    0 
##          meanbp_mean         resprate_min         resprate_max 
##                    0                   16                   11 
##        resprate_mean            tempc_min            tempc_max 
##                   11                  167                  167 
##           tempc_mean             spo2_min             spo2_max 
##                  167                   13                   14 
##            spo2_mean         aniongap_min         aniongap_max 
##                   13                   96                   96 
##          albumin_min          albumin_max            bands_min 
##                 6889                 6889                10780 
##            bands_max      bicarbonate_min      bicarbonate_max 
##                10779                   54                   53 
##        bilirubin_min        bilirubin_max       creatinine_min 
##                 5328                 5328                   37 
##       creatinine_max          glucose_min          glucose_max 
##                   37                   31                   31 
##       hemoglobin_min       hemoglobin_max          lactate_min 
##                   60                   60                 3335 
##          lactate_max         platelet_min         platelet_max 
##                 3335                   67                   67 
##        potassium_min        potassium_max              ptt_min 
##                   27                   27                 1224 
##              ptt_max              inr_min              inr_max 
##                 1224                 1177                 1177 
##           sodium_min           sodium_max              bun_min 
##                   29                   28                   38 
##              bun_max              wbc_min              wbc_max 
##                   38                   71                   71 
##                  po2                 pco2                   ph 
##                 4003                 4004                 4005 
##          urineoutput 
##                    0

While some of the blood gas associated values are missing for a litle under a third of the cohort, the major issues are bands, albumin, and bilirubin. Clinically, bands are an excellent indicator of an infectious process, but they will probably not help us much in our modeling, especially in a cohort with infection required as an inclusion criteria. Furthermore, bilirubin and albumin have clinical utility, but are not routinely collected on initial admission unless the history of present illness suggest their necessity, and thus they can also be removed.

cohort <- cohort %>% select(-bands_min, -bands_max, -bilirubin_min, 
                            -bilirubin_max, -albumin_min, -albumin_max)

We’ll keep the ABG associated tests, despite them missing for so many, because ABG’s tell us so much about acid-base physiology which is important in all medicine, and especially critical care. That said, if they are not found to be important to the GBM, we will remove them.

With that, we are ready to move onto model building.

Final Analysis

First, lets make a training set.

set.seed(10)
indices <- createDataPartition(y = cohort$hospital_expire_flag, p = 0.9)$Resample

train <- cohort %>% dplyr::slice(indices)
test <- cohort %>% dplyr::slice(-indices)

These data need to be converted to matrices in order to be used with xgboost.

train_label <- train$hospital_expire_flag
test_label <- test$hospital_expire_flag

train_mat <- as.matrix(train[,3:58])
test_mat <- as.matrix(test[,3:58])

train_xgmat <- xgb.DMatrix(data = train_mat, label = train_label)
test_xgmat <- xgb.DMatrix(data = test_mat, label = test_label)

We begin by setting starting paramters. These starting points were chosen based on prior work and reading.

params <- list(booster = "gbtree", objective = "binary:logistic",
               eta = 0.3, gamma = 0, max_depth = 10, min_child_weight = 3, 
               subsample = 0.5, colsample_bytree = 0.5)

Using these baseline parameters we’ll build our first mode.

mort_gbm1 <- xgb.train(params = params, data = train_xgmat, nround = 100, 
                       watchlist = list(train = train_xgmat), 
                       print_every_n = 10, early_stopping_rounds = 50, 
                       maximize = T, eval_metric = "auc")
## [1]  train-auc:0.754550 
## Will train until train_auc hasn't improved in 50 rounds.
## 
## [11] train-auc:0.900717 
## [21] train-auc:0.942498 
## [31] train-auc:0.969411 
## [41] train-auc:0.983981 
## [51] train-auc:0.991971 
## [61] train-auc:0.995228 
## [71] train-auc:0.998212 
## [81] train-auc:0.999105 
## [91] train-auc:0.999659 
## [100]    train-auc:0.999916

We now evaluate this first model.

pred <- predict(mort_gbm1, test_xgmat)
roc_gbm1 <- roc(as.numeric(pred > 0.5), test_label)
roc_gbm1$auc
## Area under the curve: 0.6912
ggroc(roc_gbm1) + geom_abline(slope = 1, intercept = 1)

confusionMatrix(as.numeric(pred > 0.5), test_label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1031  200
##          1   81   97
##                                           
##                Accuracy : 0.8006          
##                  95% CI : (0.7787, 0.8211)
##     No Information Rate : 0.7892          
##     P-Value [Acc > NIR] : 0.1556          
##                                           
##                   Kappa : 0.2974          
##  Mcnemar's Test P-Value : 1.932e-12       
##                                           
##             Sensitivity : 0.9272          
##             Specificity : 0.3266          
##          Pos Pred Value : 0.8375          
##          Neg Pred Value : 0.5449          
##              Prevalence : 0.7892          
##          Detection Rate : 0.7317          
##    Detection Prevalence : 0.8737          
##       Balanced Accuracy : 0.6269          
##                                           
##        'Positive' Class : 0               
## 

This first model is rather specific, but poorly sensitive. This follows logically given the prevalence of death in the cohort. However, we saw that the training AUC was quite good; we can introduce regularization to help with this issue.

While a grid search of the parameter space is the oft recommended approach to hyperparameter tuning, it is rather inefficent. In addition to the computational burden, it has been suggested in the literature that it doesn’t make sense given that most hyperparameters “do not matter much.” Bergstra and Benigo describe this in the Journal of Machine Learning.

Instead, we’ll draw 1,000 random samples from the parameter space. We’ll optimize gamma, the regularization parameter, as well as the learning rate (eta), subsample, sample by tree, our tree depth, and our the minimum child weight.

note: I’ve set this to eval false for knitting since I recorded the values I obtained when initial coding this I’ve done this because runing it takes more than an hour

parallelStartSocket(cpus = detectCores())
parallelLibrary("mlr")
traintask <- makeClassifTask (data = train, target = "hospital_expire_flag")
testtask <- makeClassifTask (data = test, target = "hospital_expire_flag")

xgb_lrn <- makeLearner("classif.xgboost", predict.type = "response")
xgb_lrn$par.vals <- list(booster = "gbtree", objective = "binary:logistic", 
                         nrounds = 50L, eval_metric = "auc")

params <- makeParamSet(makeIntegerParam("max_depth", lower = 10L, upper = 15L),
                       makeIntegerParam("gamma", lower = 10L, upper = 20L),
                       makeIntegerParam("min_child_weight", lower = 15L, upper = 25L),
                       makeNumericParam("eta", lower = 0.2, upper = 0.3),
                       makeNumericParam("subsample", lower = 0.3, upper = 0.5), 
                       makeNumericParam("colsample_bytree", lower = 0.3, upper = 0.5))

resamp <- makeResampleDesc("CV", stratify = T, iters = 5L)
tune_control <- makeTuneControlRandom(maxit = 1000)

tune <- tuneParams(learner = xgb_lrn, task = traintask, resampling = resamp, 
                   measures = acc, par.set = params, control = tune_control, 
                   show.info = T)
parallelStop()

tuned_lrn <- setHyperPars(xgb_lrn, par.vals = tune$x) # assign the optimal parameters
mort_gbm2 <- mlr::train(learner = tuned_lrn, task = traintask)

Because the above search is random, and it was tweaked and run multiple times initially, re-compilation of this file may result in changes to the output. Becuase I wanted to write up this work, I’ve manually saved the output I recieved.

The initial output I recieved and carried forward in my analysis is as follows:

max_depth = 10 gamma = 14 min_child_weight = 25 eta = 0.267 subsample = 0.382 colsample_bytree = 0.443

params <- list(booster = "gbtree", objective = "binary:logistic",
               eta = 0.267, gamma = 14, max_depth = 10, min_child_weight = 25, 
               subsample = 0.382, colsample_bytree = 0.443)

We have yet to optimize our iteration count, and so finally, we’ll use 5-fold cross validation to set that.

nround_cv <- xgb.cv(params = params, data = train_xgmat, nrounds = 200, 
                  nfold = 5, showsd = T, stratified = T, print_every_n = 10,
                  early_stopping_rounds = 50, maximize = F, metrics = "error")
## [1]  train-error:0.205167+0.004108   test-error:0.209284+0.010468 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 50 rounds.
## 
## [11] train-error:0.190328+0.001622   test-error:0.195017+0.008555 
## [21] train-error:0.184790+0.001665   test-error:0.190840+0.007913 
## [31] train-error:0.183687+0.001622   test-error:0.190367+0.007907 
## [41] train-error:0.182859+0.002162   test-error:0.190288+0.007712 
## [51] train-error:0.183174+0.001961   test-error:0.189421+0.008163 
## [61] train-error:0.182859+0.001787   test-error:0.190130+0.007444 
## [71] train-error:0.182248+0.002187   test-error:0.189500+0.006770 
## [81] train-error:0.182090+0.002154   test-error:0.189105+0.006296 
## [91] train-error:0.181775+0.002330   test-error:0.190052+0.006154 
## [101]    train-error:0.181972+0.002575   test-error:0.190130+0.006109 
## [111]    train-error:0.181184+0.002523   test-error:0.189894+0.005099 
## Stopping. Best iteration:
## [69] train-error:0.182347+0.002121   test-error:0.188712+0.007325
mort_gbm2 <- xgb.train(params = params, data = train_xgmat, nround = nround_cv$best_iteration, 
                       watchlist = list(train = train_xgmat), 
                       print_every_n = 10, early_stopping_rounds = 50, 
                       maximize = T, eval_metric = "auc")
## [1]  train-auc:0.656760 
## Will train until train_auc hasn't improved in 50 rounds.
## 
## [11] train-auc:0.756205 
## [21] train-auc:0.763559 
## [31] train-auc:0.769669 
## [41] train-auc:0.772306 
## [51] train-auc:0.772530 
## [61] train-auc:0.774704 
## [69] train-auc:0.776851
pred <- predict(mort_gbm2, test_xgmat)
roc <- roc(as.numeric(pred > 0.5), test_label)
ggroc(roc) + geom_abline(slope = 1, intercept = 1)

roc$auc
## Area under the curve: 0.7463

Our final model achieves an AUROC of 0.7463, just under the AUROC of SOFA. It is worth noting that our initial model had an AUC of 0.6912, and so tuning was very useful.

imp_mat <- xgb.importance(feature_names = colnames(train_xgmat), model = mort_gbm2)
xgb.ggplot.importance(imp_mat)

The features ultimately selected are shown above. The selected features make a great deal of sense: burden of comorbidity and age lead the pack. Our EDA demonstrated the importance of these features. It is also noteworthy that ethnicity emerged, as our EDA might have predicted.

The most noteworthy conclusion of this initial work is that only physiological data and clinical assessments (Elixhauser, GCS) were kept by the algorithim. Laboratory data were not useful, falling below even ethnicity.